Popularity and average top 10% percentile count over time¶
| column_name | data_type | |
|---|---|---|
| 0 | rating | integer |
| 1 | created_at_dt | timestamp with time zone |
| 2 | created_at | text |
| 3 | author_id | text |
| 4 | content | text |
| 5 | title | text |
| 6 | podcast_id | text |
| year_month | total_reviews | avg_reviews_per_podcast | |
|---|---|---|---|
| 0 | 2023-02-01 00:00:00+02:00 | 4818 | 3 |
| 1 | 2023-01-01 00:00:00+02:00 | 13575 | 4 |
| 2 | 2022-12-01 00:00:00+02:00 | 12704 | 3 |
| 3 | 2022-11-01 00:00:00+02:00 | 14267 | 3 |
| 4 | 2022-10-01 00:00:00+03:00 | 16699 | 3 |
| ... | ... | ... | ... |
| 202 | 2006-04-01 00:00:00+03:00 | 271 | 1 |
| 203 | 2006-03-01 00:00:00+02:00 | 294 | 2 |
| 204 | 2006-02-01 00:00:00+02:00 | 330 | 2 |
| 205 | 2006-01-01 00:00:00+02:00 | 322 | 2 |
| 206 | 2005-12-01 00:00:00+02:00 | 208 | 2 |
207 rows × 3 columns
| year_month | total_reviews | avg_reviews_per_podcast | percentile_10 | percentile_50 | percentile_90 | |
|---|---|---|---|---|---|---|
| 0 | 2023-02-01 00:00:00+02:00 | 4818 | 3 | 1.0 | 2.0 | 7.0 |
| 1 | 2023-01-01 00:00:00+02:00 | 13575 | 4 | 1.0 | 2.0 | 8.0 |
| 2 | 2022-12-01 00:00:00+02:00 | 12704 | 3 | 1.0 | 1.0 | 7.0 |
| 3 | 2022-11-01 00:00:00+02:00 | 14267 | 3 | 1.0 | 1.0 | 7.0 |
| 4 | 2022-10-01 00:00:00+03:00 | 16699 | 3 | 1.0 | 1.0 | 7.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 202 | 2006-04-01 00:00:00+03:00 | 271 | 1 | 1.0 | 1.0 | 4.0 |
| 203 | 2006-03-01 00:00:00+02:00 | 294 | 2 | 1.0 | 1.0 | 4.0 |
| 204 | 2006-02-01 00:00:00+02:00 | 330 | 2 | 1.0 | 1.0 | 3.0 |
| 205 | 2006-01-01 00:00:00+02:00 | 322 | 2 | 1.0 | 1.0 | 4.0 |
| 206 | 2005-12-01 00:00:00+02:00 | 208 | 2 | 1.0 | 1.0 | 5.0 |
207 rows × 6 columns
We can see that the share of reviews for the top 5% and especially top 1% of podcasts started increasing significantly after 2018. This implies that the investments were somewhat worthwhile.
Some questions we need to consider: If Spotify's investment in podcasts (both specific and overall infrastructure) resulted in significant user growth, did most of these users:
- Disproportionately listen to these most expensive/most popular podcasts?
- If so, did these new users later engage with other podcasts as well?
- Were the new users retained over a longer period, or was there a significant drop-off?
Did most of the growth went to the top 1% of podcasts¶
Hypothesis I¶
Did Spotify's investment and overall strategy of focusing on a small number of creators prove effective? Specifically, did the growth rate in popularity of the most popular podcasts (defined as the top 1st percentile based on the number of reviews) exceed that of other podcasts? Based on this question, we formulate our first hypothesis:
H1: The number of reviews for the most popular podcasts is increasing at a faster rate than for the bottom 99% of all podcasts.
To test this hypothesis, follow these steps:
- Transform the
reviews_by_month_count_df_after_2015dataframe to show the monthly growth rate for the top 1% and bottom 99% of podcasts.
| count | |
|---|---|
| 0 | 1981420 |
Growth rate top 1%: mean:0.08 (stdev: 0.44)
Growth bottom 99%: mean:0.02 (stdev: 0.14)
Growth rate top 1%: mean:0.08 (stdev: 0.44)
We need to decide whether to use a parametric or a non parametric tests. We should use a parametric test like T-Test if:
- Data is normally distributed and/or the sample size is very large.
- Homogeneity of variances, the variances of the two groups being compared are equal
If none of these assumptions hold we should use a non-parametric test like Mann-Whitney U:
Shapiro-Wilk test is used to test for normality, result:
Test stat: 0.76
P value: 0.0 (If p-value < 0.05, the data is not normally distributed)
Indicates that growth rates are not normally distributed meaning that we can't use the t-test which assumes normal dsitribution, instead was can use Mann-Whitney U test
Levene test for Homogeneity of Variances result:
Test stat: 40.53
P value: 0.0
Indicates that growth rates for both groups do not have homogenous variance a which concur with our conclusion to use a non parametric test
Mann-Whitney U test for Homogeneity of Variances result:
U value: 40.53
P value: 0.0
The p-value of 0 rejects the null hypothesis, showing a significant difference in growth rates between the top 1% and bottom 99% of podcasts. This supports the original hypothesis. Comparing median growth rates will indicate which group is growing faster:.
Average growth for top 1%: 0.08
Average growth for the bottom 99%: 0.02
Growth rate for top 1%:
Mean: 0.08 (Std Dev: 0.44)
Growth rate for bottom 99%:
Mean: 0.02 (Std Dev: 0.14)
To decide the appropriate test, the following should be considered:
- Data should be normally distributed or the sample size should be large.
- Variances of the two groups being compared should be equal.
If these assumptions do not hold, a non-parametric test like the Mann-Whitney U Test should be used.
Shapiro-Wilk Test for Normality:
Test Stat: 0.76
P-value: 0.0 (If p-value < 0.05, data is not normally distributed).
This indicates that a non-parametric test should be used.
Levene Test for Homogeneity of Variances:
Test Stat: 40.53
P-value: 0.0
This also supports the decision to use a non-parametric test.
Mann-Whitney U Test:
U-value: 40.53
P-value: 0.0
The p-value indicates a significant difference in growth rates between the top 1% and bottom 99%. This supports the initial hypothesis.
Comparison of Average Growth Rates:
Average growth for top 1%: 0.08
Average growth for the bottom 99%: 0.02
Distribution of Podcasts by Popularity:¶
Gini coefficient is: 0.9302907076746617
We can further see that the distribution of reviews between podcasts is extremely unevenly distributed. Specifically, the top 1% of all podcasts by review count have 57% of all reviews.
.1 Unique Podcats Reviewed per User¶
Podcast Genre Analysis¶
| category | podcast_id | rating | content_length | title_length | review_count | |
|---|---|---|---|---|---|---|
| 0 | true-crime | bf5bf76d5b6ffbf9a31bba4480383b7f | 4.353402 | 265.356595 | 12.0 | 31010 |
| 1 | true-crime | bc5ddad3898e0973eb541577d1df8004 | 3.686242 | 305.848232 | 61.0 | 9587 |
| 2 | comedy | bc5ddad3898e0973eb541577d1df8004 | 3.686242 | 305.848232 | 61.0 | 9587 |
| 3 | news | f5fce0325ac6a4bf5e191d6608b95797 | 3.839367 | 219.777151 | 20.0 | 7265 |
| 4 | true-crime | b1a3eb2aa8e82ecbe9c91ed9a963c362 | 4.247432 | 249.893554 | 19.0 | 6717 |
| ... | ... | ... | ... | ... | ... | ... |
| 212367 | leisure-animation-manga | f0e247111c8985e0c5e14cc8d6442f09 | 5.000000 | 173.000000 | 22.0 | 1 |
| 212368 | leisure | d0bc8c5bf6f0f1eeda8d5c1c8b38adc9 | 5.000000 | 119.000000 | 17.0 | 1 |
| 212369 | leisure-animation-manga | f1bf522813566465708ba99c92813c84 | 5.000000 | 481.000000 | 17.0 | 1 |
| 212370 | judaism | f564e91cdf68e9c51a40fc38b73da7b6 | 5.000000 | 228.000000 | 24.0 | 1 |
| 212371 | history | c3f0fe1ab04701f43cc02fa0316d23cf | 5.000000 | 27.000000 | 17.0 | 1 |
212372 rows × 6 columns
Total Unique Podcasts 110024
| category | review_count | unique_podcasts | |
|---|---|---|---|
| 79 | society-culture | 329054 | 13731 |
| 16 | comedy | 306950 | 11864 |
| 103 | true-crime | 154221 | 1264 |
| 20 | education | 145413 | 8827 |
| 68 | religion-spirituality | 141541 | 12095 |
| 104 | tv-film | 133763 | 6469 |
| 8 | business | 116883 | 8072 |
| 86 | sports | 113116 | 7266 |
| 59 | news | 103378 | 4297 |
| 30 | health-fitness | 96948 | 6050 |
| 15 | christianity | 84668 | 7954 |
| 0 | arts | 84494 | 6078 |
| 41 | kids-family | 66247 | 2383 |
| 38 | history | 57816 | 1663 |
| 46 | leisure | 54142 | 4178 |
| top_level_category | review_count | unique_podcasts | |
|---|---|---|---|
| 19 | society | 437998 | 19441 |
| 4 | comedy | 333317 | 12803 |
| 2 | business | 223394 | 12931 |
| 5 | education | 217351 | 13005 |
| 8 | health | 184358 | 8731 |
| 21 | sports | 184103 | 9280 |
| 16 | news | 179549 | 6606 |
| 24 | tv | 168752 | 8285 |
| 23 | true-crime | 154221 | 1264 |
| 17 | religion | 143055 | 12246 |
| 0 | arts | 141137 | 9675 |
| 14 | leisure | 99622 | 7011 |
| 13 | kids | 88307 | 3321 |
| 3 | christianity | 84668 | 7954 |
| 15 | music | 60633 | 6161 |
Index(['category', 'podcast_id', 'rating', 'content_length', 'title_length',
'review_count', 'top_level_category'],
dtype='object')
array(['society', 'comedy', 'business', 'education', 'health', 'sports',
'news', 'tv', 'true-crime', 'religion', 'arts', 'leisure', 'kids',
'christianity', 'music'], dtype=object)
array(['society', 'comedy', 'business', 'education', 'health', 'sports',
'news', 'tv', 'true-crime', 'religion', 'arts', 'leisure', 'kids',
'christianity', 'music'], dtype=object)
array(['true-crime', 'comedy', 'news', 'society', 'kids', 'education',
'religion', 'sports', 'tv', 'health', 'business', 'music', 'arts',
'christianity', 'leisure'], dtype=object)
/var/folders/t4/9h54vnjj0bs5mhx41v0qcw0c0000gn/T/ipykernel_27649/2675104520.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Hypothesis II¶
We can see that the proportion of reviews belong to the Top 1% of podcasts varies wildly between genre. Based on this we can check for which categories the proportion of reviews which belong to the top 1% increased the most
| category | podcast_id | year_month | review_count | top_level_category | |
|---|---|---|---|---|---|
| 0 | business | a00018b54eb342567c94dacfb2a3e504 | 2017-10-31 | 1 | business |
| 1 | christianity | a00043d34e734b09246d17dc5d56f63c | 2019-09-30 | 1 | christianity |
| 2 | religion-spirituality | a00043d34e734b09246d17dc5d56f63c | 2019-09-30 | 1 | religion |
| 3 | religion-spirituality | a0004b1ef445af9dc84dad1e7821b1e3 | 2011-08-31 | 1 | religion |
| 4 | spirituality | a0004b1ef445af9dc84dad1e7821b1e3 | 2011-08-31 | 1 | spirituality |
| ... | ... | ... | ... | ... | ... |
| 1247729 | news | ffff32caeedd6254573ad1cc49852595 | 2018-02-28 | 1 | news |
| 1247745 | arts | ffff5db4b5db2d860c49749e5de8a36d | 2011-05-31 | 1 | arts |
| 1247759 | comedy | ffff66f98c1adfc8d0d6c41bb8facfd0 | 2018-09-30 | 4 | comedy |
| 1247761 | education | ffff923482740bc21a0fe184865ec2e2 | 2018-04-30 | 1 | education |
| 1247763 | comedy | ffffbd44ec5f79d502f16ae372bf2d4f | 2021-08-31 | 1 | comedy |
151349 rows × 5 columns
Index(['category', 'podcast_id', 'year_month', 'review_count',
'top_level_category'],
dtype='object')
/var/folders/t4/9h54vnjj0bs5mhx41v0qcw0c0000gn/T/ipykernel_27649/2921483724.py:31: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| top_level_category | post_cutoff | is_top_1_percent | review_count | total | prop_of_all_reviews | |
|---|---|---|---|---|---|---|
| 1 | arts | False | True | 3574 | 16557 | 0.215860 |
| 3 | arts | True | True | 2285 | 9112 | 0.250768 |
| 5 | buddhism | False | True | 47 | 184 | 0.255435 |
| 8 | business | False | True | 9377 | 33327 | 0.281363 |
| 10 | business | True | True | 4451 | 19714 | 0.225779 |
| 12 | christianity | False | True | 1757 | 10361 | 0.169578 |
| 14 | christianity | True | True | 2201 | 7094 | 0.310262 |
| 16 | comedy | False | True | 8684 | 30434 | 0.285339 |
| 18 | comedy | True | True | 5584 | 15611 | 0.357696 |
| 20 | education | False | True | 6855 | 24218 | 0.283054 |
| 22 | education | True | True | 3860 | 19103 | 0.202063 |
| 24 | fiction | False | True | 841 | 2451 | 0.343125 |
| 26 | fiction | True | True | 1084 | 3467 | 0.312662 |
| 28 | government | False | True | 594 | 1936 | 0.306818 |
| 30 | government | True | True | 127 | 857 | 0.148191 |
| 32 | health | False | True | 4417 | 17304 | 0.255259 |
| 34 | health | True | True | 2611 | 14157 | 0.184432 |
| 36 | hinduism | False | True | 12 | 34 | 0.352941 |
| 39 | history | False | True | 1541 | 4142 | 0.372042 |
| 41 | history | True | True | 393 | 2545 | 0.154420 |
| 43 | islam | False | True | 23 | 257 | 0.089494 |
| 45 | islam | True | True | 33 | 124 | 0.266129 |
| 47 | judaism | False | True | 40 | 246 | 0.162602 |
| 49 | judaism | True | True | 28 | 258 | 0.108527 |
| 51 | kids | False | True | 1808 | 7357 | 0.245752 |
| 53 | kids | True | True | 1572 | 5565 | 0.282480 |
| 55 | leisure | False | True | 3168 | 10201 | 0.310558 |
| 57 | leisure | True | True | 1052 | 6776 | 0.155254 |
| 59 | music | False | True | 2106 | 8978 | 0.234573 |
| 61 | music | True | True | 1551 | 4620 | 0.335714 |
| 63 | news | False | True | 3627 | 11611 | 0.312376 |
| 65 | news | True | True | 3560 | 10340 | 0.344294 |
| 67 | religion | False | True | 3255 | 17109 | 0.190251 |
| 69 | religion | True | True | 2993 | 10715 | 0.279328 |
| 71 | science | False | True | 1011 | 3372 | 0.299822 |
| 73 | science | True | True | 294 | 2073 | 0.141823 |
| 75 | society | False | True | 13055 | 41444 | 0.315003 |
| 77 | society | True | True | 6256 | 26072 | 0.239951 |
| 79 | spirituality | False | True | 1083 | 4115 | 0.263183 |
| 81 | spirituality | True | True | 688 | 2726 | 0.252384 |
| 83 | sports | False | True | 5028 | 16407 | 0.306455 |
| 85 | sports | True | True | 2759 | 11894 | 0.231966 |
| 87 | technology | False | True | 1646 | 6719 | 0.244977 |
| 89 | technology | True | True | 445 | 1915 | 0.232376 |
| 91 | true-crime | False | True | 2325 | 5044 | 0.460944 |
| 93 | true-crime | True | True | 911 | 5503 | 0.165546 |
| 95 | tv | False | True | 4485 | 16644 | 0.269466 |
| 97 | tv | True | True | 2837 | 8515 | 0.333177 |
| top_level_category | pre_cutoff_ratio | post_cutoff_ratio | pre_cutoff_review_count | post_cutoff_review_count | relative_change_in_ratio | sum_review_count | |
|---|---|---|---|---|---|---|---|
| 0 | arts | 0.215860 | 0.250768 | 3574.0 | 2285.0 | 0.161715 | 5859.0 |
| 2 | business | 0.281363 | 0.225779 | 9377.0 | 4451.0 | -0.197555 | 13828.0 |
| 3 | christianity | 0.169578 | 0.310262 | 1757.0 | 2201.0 | 0.829611 | 3958.0 |
| 4 | comedy | 0.285339 | 0.357696 | 8684.0 | 5584.0 | 0.253585 | 14268.0 |
| 5 | education | 0.283054 | 0.202063 | 6855.0 | 3860.0 | -0.286134 | 10715.0 |
| 6 | fiction | 0.343125 | 0.312662 | 841.0 | 1084.0 | -0.088781 | 1925.0 |
| 7 | government | 0.306818 | 0.148191 | 594.0 | 127.0 | -0.517006 | 721.0 |
| 8 | health | 0.255259 | 0.184432 | 4417.0 | 2611.0 | -0.277472 | 7028.0 |
| 10 | history | 0.372042 | 0.154420 | 1541.0 | 393.0 | -0.584939 | 1934.0 |
| 11 | islam | 0.089494 | 0.266129 | 23.0 | 33.0 | 1.973703 | 56.0 |
| 12 | judaism | 0.162602 | 0.108527 | 40.0 | 28.0 | -0.332558 | 68.0 |
| 13 | kids | 0.245752 | 0.282480 | 1808.0 | 1572.0 | 0.149449 | 3380.0 |
| 14 | leisure | 0.310558 | 0.155254 | 3168.0 | 1052.0 | -0.500081 | 4220.0 |
| 15 | music | 0.234573 | 0.335714 | 2106.0 | 1551.0 | 0.431169 | 3657.0 |
| 16 | news | 0.312376 | 0.344294 | 3627.0 | 3560.0 | 0.102177 | 7187.0 |
| 17 | religion | 0.190251 | 0.279328 | 3255.0 | 2993.0 | 0.468210 | 6248.0 |
| 18 | science | 0.299822 | 0.141823 | 1011.0 | 294.0 | -0.526975 | 1305.0 |
| 19 | society | 0.315003 | 0.239951 | 13055.0 | 6256.0 | -0.238259 | 19311.0 |
| 20 | spirituality | 0.263183 | 0.252384 | 1083.0 | 688.0 | -0.041032 | 1771.0 |
| 21 | sports | 0.306455 | 0.231966 | 5028.0 | 2759.0 | -0.243067 | 7787.0 |
| 22 | technology | 0.244977 | 0.232376 | 1646.0 | 445.0 | -0.051437 | 2091.0 |
| 23 | true-crime | 0.460944 | 0.165546 | 2325.0 | 911.0 | -0.640854 | 3236.0 |
| 24 | tv | 0.269466 | 0.333177 | 4485.0 | 2837.0 | 0.236431 | 7322.0 |
| top_level_category | year_month_first | prop_top_1_percent_first | year_month_last | prop_top_1_percent_last | prop_change | |
|---|---|---|---|---|---|---|
| 15 | music | 2005-11-30 | 0.0 | 2022-12-31 | 0.891667 | 0.891667 |
| 19 | society | 2005-11-30 | 0.0 | 2022-12-31 | 0.805195 | 0.805195 |
| 8 | health | 2005-11-30 | 0.0 | 2022-12-31 | 0.365854 | 0.365854 |
| 0 | arts | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 13 | kids | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 23 | true-crime | 2015-06-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 22 | technology | 2005-11-30 | 0.0 | 2022-10-31 | 0.000000 | 0.000000 |
| 21 | sports | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 20 | spirituality | 2005-11-30 | 0.0 | 2022-11-30 | 0.000000 | 0.000000 |
| 18 | science | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 17 | religion | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 16 | news | 2005-11-30 | 0.0 | 2022-11-30 | 0.000000 | 0.000000 |
| 14 | leisure | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 12 | judaism | 2005-12-31 | 0.0 | 2022-08-31 | 0.000000 | 0.000000 |
| 1 | buddhism | 2005-11-30 | 0.0 | 2022-05-31 | 0.000000 | 0.000000 |
| 11 | islam | 2005-12-31 | 0.0 | 2022-08-31 | 0.000000 | 0.000000 |
| 10 | history | 2005-11-30 | 0.0 | 2022-11-30 | 0.000000 | 0.000000 |
| 9 | hinduism | 2006-11-30 | 0.0 | 2021-09-30 | 0.000000 | 0.000000 |
| 7 | government | 2005-11-30 | 0.0 | 2022-10-31 | 0.000000 | 0.000000 |
| 6 | fiction | 2005-12-31 | 0.0 | 2022-11-30 | 0.000000 | 0.000000 |
| 5 | education | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 4 | comedy | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 3 | christianity | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 2 | business | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
| 24 | tv | 2005-11-30 | 0.0 | 2022-12-31 | 0.000000 | 0.000000 |
'runs'
| column_name | data_type | |
|---|---|---|
| 0 | run_at | text |
| 1 | max_rowid | integer |
| 2 | reviews_added | integer |
'podcasts'
| column_name | data_type | |
|---|---|---|
| 0 | podcast_id | text |
| 1 | itunes_id | integer |
| 2 | slug | text |
| 3 | itunes_url | text |
| 4 | title | text |
'categories'
| column_name | data_type | |
|---|---|---|
| 0 | podcast_id | text |
| 1 | category | text |
'reviews'
| column_name | data_type | |
|---|---|---|
| 0 | author_id | text |
| 1 | podcast_id | text |
| 2 | created_at | text |
| 3 | title | text |
| 4 | content | text |
| 5 | rating | integer |
| 6 | created_at_dt | timestamp with time zone |